Using R to Build a Community Data Explorer for Cincinnati (CoDEC)

CCHMC R Users Group

Cole Brokamp, Erika Manning, Andrew Vancil

5/10/23

Welcome

Join the RUG Outlook group for updates and events. {width=180%}

Upcoming BUG + RUG Event

Using R to Build a Community Data Explorer for Cincinnati (CoDEC)

  1. Introduction to CoDEC
  2. Sharing CoDEC Data
  3. Exploring CoDEC Data

Background

The White House’s Equitable Data Working Group1:

  • Equitable data are “those that allow for rigorous assessment of the extent to which government programs and policies yield consistently fair, just, and impartial treatment of all individuals.”
  • Equitable data should “illuminate opportunities for targeted actions that will result in demonstrably improved outcomes for underserved communities.”
  • Make disaggregated data the norm while being “… intentional about when data are collected and shared, as well as how data are protected so as not to exacerbate the vulnerability of members of underserved communities, many of whom face the heightened risk of harm if their privacy is not protected.”

Disaggregation

  • Open data can fall short of driving action if it is not equitable.

  • Disaggregating1 data by sensitive attributes, like race and ethnicity, can elucidate inequities that would otherwise remain hidden.

Open data is necessary and not sufficient to drive the type of action that we need to create a more equitable society.

— The U.S. Chief Data Scientist, Denice Ross2

Privacy

  • Data are people1
  • Privacy is a spectrum of the tradeoffs between risks and benefits to individuals and populations
  • Data collected at the individual-level by one organization often cannot be shared2 with another organization due to legal restrictions or organization-specific data governance policies
  • Community-level (e.g. neighborhood, census tract, ZIP code) data disaggregated by gender, race, or other sensitive attributes
  • Achieving data harmonization upstream of storage allows for contribution of disaggregated, community-level data without disclosing individual-level data when sharing across organizations

The TRUST principles for digital repositories1

Creating and maintaining an open community-level data resource equips the entire community for data-powered decision making and boosts organizational trustworthiness. Demonstrating reliability and capability of appropriately managing shared data helps earn the trust of organizations and communities intended to be served:

  • 🤲 transparent: make specific repository services and data holdings verifiable by publicly accessible evidence
  • 📃 responsible: ensure authenticity and integrity of data holdings
  • 👥 user-focused: meet data management norms and expectations of target user communities
  • ⏳️️ sustainable: preserve services and data holdings for the long-term
  • ⚙️ technological: provide infrastructure and capabilities supporting secure, persistent, and reliable services

FAIR1

  • 🔎 findable: use a unique and persistent identifier, add rich metadata (using existing standards2)
  • 🔓 accessible: store in a data repository (⚠️ personal/classified information, but metadata still accessible)
  • ⚙️ interoperable: use an open file format with controlled vocabularies, reference relevant datasets
  • ♻️ reusable: well documented, including a description (README with data sources, background, and how to reproduce the data), a data dictionary (field descriptions, units, titles, missingness), and usage licenses (for code3 or data/presentations/papers4)

Community Data Explorer for Cincinnati (CoDEC)

A data repository composed of equitable, community-level data for Cincinnati.

Community Data Explorer for Cincinnati (CoDEC)

  • Define common data specification for community-level data
    • builds on Frictionless
    • FAIR, TRUST
    • privacy
    • equitable disaggregation
  • Provide a method and tools for harmonizing, storing, accessing, and sharing community-level data
    • spatiotemporal harmonization for integration with other data
    • data catalog
    • API for accessing data at scale and on demand

Using these tools, a collection of extant community-level data resources is automatically transformed into a harmonized, community-level tabular data package that is openly available and accompanied by:

  • a richly-documented data catalog
  • a web-based interface for exploring and learning from data
  • an API for accessing data at scale and on demand

CoDEC Overview

%%{init: { "fontFamily": "arial" } }%%

flowchart LR

classDef I fill:#E49865,stroke:#333,stroke-width:0px;
classDef II fill:#EACEC5,stroke:#333,stroke-width:0px;
classDef III fill:#CBD6D5,stroke:#333,stroke-width:0px;
classDef IIII fill:#8CB4C3,stroke:#333,stroke-width:0px;
classDef V fill:#396175,color:#F6EAD8,stroke:#333,stroke-width:0px;

subgraph source-box [data sources]
    org(community \norganization):::I
    jfs(government \n organization):::I
    cchmc("healthcare \n organization"):::I
    acs("built, natural, and \n social environment"):::I
end
class source-box II

stage(collection of community-\nlevel data):::I

org --> |"data \n support"| stage
jfs --> |decentralized \n geocoding| stage
cchmc --> |spatiotemporal \n aggregation| stage
acs --> |automatic \n interpolation| stage
stage --> codec-box

subgraph codec-box ["Community Data Explorer for Cincinnati (CoDEC)"]
    ingest("(meta)data harmonization"):::IIII
    data(community-level \n tabular data resource):::IIII
    data-catalog("interactive data catalog\n geomarker.io/codec"):::IIII
    ingest --> data
    data --> data-catalog
    data --> api(data API):::IIII
    api --> bindings(R code \n for accessing data):::IIII
    data-catalog --> download(explore, map, download):::V
end

class codec-box III

bindings --> dashboard("dashboards and reports"):::V
bindings --> qr(QI & research):::V
api ---> anywhere(public access):::V

Data Harmonization

  • CoDEC encodes data streams about the communities in which we live into a common format (census tract and month) so that it can be decoded into different community-level geographies and different time frames.

CoDEC Data Available Now

How to Read Data in R Using CoDEC

codec::codec_data("hamilton_property_code_enforcement")
# A tibble: 226 × 3
   census_tract_id_2020 violations_per_household  year
   <chr>                                   <dbl> <int>
 1 39061000200                             0.328  2022
 2 39061000700                             0.647  2022
 3 39061000900                             2.65   2022
 4 39061001000                             1.38   2022
 5 39061001100                             2.01   2022
 6 39061001600                             5.06   2022
 7 39061001700                             5.43   2022
 8 39061001800                             3.09   2022
 9 39061001900                             1.32   2022
10 39061002000                             0.957  2022
# ℹ 216 more rows

How to Read Metadata in R Using CoDEC

codec::codec_data("hamilton_property_code_enforcement") |>
  codec::glimpse_tdr()
$attributes
# A tibble: 7 × 2
  name        value                                                             
  <chr>       <chr>                                                             
1 profile     tabular-data-resource                                             
2 name        hamilton_property_code_enforcement                                
3 path        hamilton_property_code_enforcement.csv                            
4 version     0.1.2                                                             
5 title       Hamilton County Property Code Enforcement                         
6 homepage    https://geomarker.io/hamilton_property_code_enforcement           
7 description Number of property code enforcements per household by census tract

$schema
# A tibble: 3 × 4
  name                     description                               type  title
  <chr>                    <chr>                                     <chr> <chr>
1 census_tract_id_2020     census tract identifier                   stri… <NA> 
2 violations_per_household number of property code enforcements per… numb… <NA> 
3 year                     data year                                 inte… Year 

Sharing CoDEC Data

Frictionless Standards

Developed by the Open Knowledge Foundation, the frictionless1 standards are a set of patterns for describing data, including datasets (Data Package), files (Data Resource), and tables (Table Schema). A Data Package is a simple container format used to describe and package a collection of data and metadata, including schemas. These metadata are contained in a specific file (separate from the data file), usually written in JSON or YAML, that describes something specific to each Frictionless Standard:

  • Table Schema: describes a tabular file by providing its dimension, field data types, relations, and constraints
  • Data Resource: describes an exact tabular file providing a path to the file and details like title, description, and others
  • Tabular Data Resource = Data Resource + Table Schema
  • CSV dialect: describes the formatting specific to the various dialects of CSV files
  • Data Package & Tabular Data Package: describes a collection of tabular files providing data resource information from above along with general information about the package itself, a license, authors, and other metadata

CoDEC Specifications

%%{init: { "fontFamily": "Arial" } }%%

flowchart TB

classDef I fill:#E49865,stroke:#333,stroke-width:2px;
classDef II fill:#EACEC5,stroke:#333,stroke-width:2px;
classDef III fill:#CBD6D5,stroke:#333,stroke-width:2px;
classDef IIII fill:#8CB4C3,stroke:#333,stroke-width:2px;

tdr([tabular-data-resource]):::I

name(name):::II
path(path):::II
version(version):::II   
schema([schema]):::II
title(title):::II
homepage(homepage):::II
description(description):::II

tdr --- name
tdr --- path
tdr --- version   
tdr --- title
tdr --- description
tdr --- homepage
tdr --- schema

schema --- fields([fields]):::III
schema --- primaryKey(primaryKey):::III
schema --- foreignKey(foreignKey):::III

fields --- field_name_1(field_1:\nname \n title \n description \n type):::IIII
fields --- field_name_2(field_2:\nname \n title \n type \n constraints):::IIII
fields --- field_name_3(field_3:\nname \n title \n description \n type \n constraints):::IIII

https://geomarker.io/codec/articles/specs

{cincy}

  • CoDEC relies on the {cincy} R package to define Cincinnati-area geographies and interpolate area-level data between census tracts, neighborhoods, and ZIP codes in different years.

CoDEC

The goal of the R package {codec} is to support CoDEC data infrastructure through:

  • curating metadata for tabular data in R: vignette("curating-metadata")
  • reading and writing tabular-data-resources: vignette("reading-writing-tdr")
  • defining the CoDEC tabular-data-resource specifications: vignette("specs")
  • providing tools to check CoDEC tabular-data-resources and create an interactive data catalog: vignette("data")

Curating metadata for tabular data in R using attributes

Reading and writing tabular data resources

Tools for Checking Against CoDEC Specifications

Exploring CoDEC

Leveraging data standards for shiny?

Screenshot

Shiny

Inset panel and scatterplot

bslib layout

“crosstalk” hack

Interactive Demo

Conclusions

R …

  • curate data, metadata, visualizations all in one language

Thank You

🌐 https://geomarker.io/codec

‍💻️ github.com/geomarker-io